MongoDB is one of a number of schemaless, NoSQL databases which have became popular for use in big data and in areas where the strict schemas of a SQL database don't always fit. PyMongo is a python distribution which allows us to work with MongoDB from python. You need to download and install PyMongo and also download and install MongoDB. Before using the below tutorial you must have a mongod instance running. (in the command line cd to where MongoDB was installed, cd to the bin folder and type mongod. For me, on windows, this was C:\Program Files\MongoDB\Server\3.0\bin and then I typed mongod.
In [1]:
from pymongo import MongoClient
client = MongoClient() #connects to the running mongod instance
users = client.test_database.user #creates database "test_database" and collection "user" if they do not already exist
In [2]:
users.remove({}) #Making sure that the collection is empty before I start
Out[2]:
In SQL databases hold tables, tables contain rows and each row is made up of a number of columns. In NoSQL databases hold collections, collections hold documents and each document is made up of a number of fields. Documents are the NoSQL version of rows and are written in json, so we can write them as python dictionaries. Above we are using the database "test_database" and the collection "user". This is the collection we will insert our documents into.
In [3]:
user = {"name" : "Bob",
"location" : "Ireland",
"interests" : ["Java", "python"]}
users.insert_one(user) #insert user into users collection as a document. insert() can also be used.
Out[3]:
In [4]:
users.find() #find() returns a cursor
Out[4]:
In [5]:
users.find_one()
Out[5]:
Above we created a document with name, location and interests fields. The fields above take strings or a list of strings but it is also possible to give them ints, floats, even other documents and many other types. Using Javascript typing find() would return all documents in a collection but in python it returns a cursor which can be used to return all documents. find_one() (findOne() in Javascript) returns the first document from the collection.
Below we add another user and use the cursor to iterate over all the users and print them. This user has a different number of fields to our first user, location is not declared. This would cause an error in SQL but is perfectly acceptable in NoSQL.
In [6]:
user = {"name" : "Ted",
"interests" : ["data science", "R"]} #Note that this document does not contain the same fields (columns) as the one above
users.insert_one(user)
Out[6]:
In [7]:
for user in users.find(): #use the cursor to return all documents in the collection
print user #the order the fields will be printed in can't be guaranteed
insert_many() can be used for bulk inserts. Again note that the first user has an extra field not declared by any of the other users.
In [8]:
#Insert many users at once
new_users = [{"name" : "Mike",
"occupation" : "Data Scientist",
"interests" : ["data science", "machine learning", "python", "R"]},
{"name" : "Elliot",
"interests" : ["programming"]}]
users.insert_many(new_users)
Out[8]:
In [9]:
for user in users.find():
print user["name"] #print out just the users name
While we know that each user has included their name, if we try to iterate over a field that is not present in all documents, such as the occupation field, we will get an error. Code such as below should be used to prevent an error being thrown.
In [10]:
#printing the occupation for all users will throw an error as only one document contains this information
for user in users.find():
try:
print user["occupation"]
except (Exception):
pass
In [11]:
#find returns the cursor position of a specific user
print users.find({"name" : "Bob"}) #using find_one() instead would return the first user found whose name is Bob.
Bob has gotten a job and wants to update his profile. users.find_one({"name" : "Bob"}) returns the first user whose name is Bob. As with a dictionary we add an occupation field and set its value to "programmer" and then update the first returned Bob with the new document. Note that the ObjectId returned below is identical to the ObjectID returned the first time we inserted Bob into the collection. The ObjectId is the unique identifier of a row and this shows that we have updated the row, not made a new one.
In [12]:
update_user = users.find_one({"name" : "Bob"})
update_user["occupation"] = "programmer"
users.update({"name" : "Bob"}, update_user)
print users.find_one({"name" : "Bob"})
It turns out that "Elliot" is actually a bot account made with the purpose of spamming the other users. remove({"name" : "Elliot"}) will remove all Elliots in our collection which is ok as we only have one. For larger collections all removing or updating should be done on the ObjectID. When we use find_one() afterwards nothing is returned as there is now now user with the name Elliot in the collection.
In [13]:
users.remove({"name" : "Elliot"})
users.find_one({"name" : "Elliot"})
An example of how to delete data from a user is shown below. We pull Mike's data out of the collection and then pop off the occupation field.
In [14]:
update_user = users.find_one({"name" : "Mike"})
update_user.pop("occupation")
print update_user
Here we print off the ObjectId for Mike. This returns to us a string which we can not use in searching the collection. To do so we must convert it into an ObjectId by importing ObjectId from bson.objectid (binary javascript object notation). Below we get Mike's ObjectId as a string, convert it and search the collection for his document, update his document and then search for the new entry, all using his unique ObjectId.
In [15]:
print update_user["_id"]
In [16]:
from bson.objectid import ObjectId
id_as_string = update_user["_id"]
print users.find_one({"_id" : ObjectId(id_as_string)})
users.update({"_id" : ObjectId(id_as_string)}, update_user)
print ""
print users.find_one({"_id" : ObjectId(id_as_string)})
PyMongo allows us to leverage other great python modules such as Pandas. Here we import pandas, convert the cursor returned by using find() into a list and then pass turn that into a new DataFrame. All missing values are replaced with NaN and we can use pandas fillna() to replace these with suitable values
In [17]:
import pandas as pd
df = pd.DataFrame(list(users.find()))
print df.head()
In [19]:
df = df.fillna("None")
print df
In [ ]: